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ABSTRACT 

The Saccharomyces Genome Database (SGD, http:// 
www.yeastgenome.org) is the community resource 
for the budding yeast Saccharomyces cerevisiae. 
The SGD project provides the highest-quality 
manually curated information from peer-reviewed 
literature. The experimental results reported in 
the literature are extracted and integrated within a 
well-developed database. These data are combined 
with quality high-throughput results and provided 
through Locus Summary pages, a powerful query 
engine and rich genome browser. The acquisition, 
integration and retrieval of these data allow SGD 
to facilitate experimental design and analysis by 
providing an encyclopedia of the yeast genome, its 
chromosomal features, their functions and inter- 
actions. Public access to these data is provided to 
researchers and educators via web pages designed 
for optimal ease of use. 

INTRODUCTION 

Saccharomyces cerevisiae has been widely utilized in the 
exploration of biochemistry, molecular biology, cell 
biology and systems biology because of the ease with 
which it can be grown and manipulated, the extensive 
conservation of its genes and pathways with those of 
higher organisms, and the powerful genetic techniques 
that it offers. Many new genome-wide technologies — the 
creation of bar-coded systematic deletion sets; large-scale 
detection of protein-protein and genetic interactions and 
subcellular localizations; transcriptome, proteome and 
metabolome analysis — were first developed using yeast 
before being more widely applied to other organisms. 



For these reasons S. cerevisiae research serves as a 
model for a variety of processes that occur in humans, 
many of which are disease-associated. Direct parallels 
can be drawn between the two organisms in areas such 
as lipid metabolism (1), the unfolded protein response 
(2), mitochondrial metabolism (3), prion development 
(4) or ageing (5), to name just a few possible examples. 
The core users of Saccharomyces Genome Database 
(SGD) include those conducting research on 
Saccharomyces species as well as those exploring the 
genetics and cellular biology of other fungal genera that 
are important to basic research, human health and 
industry. Researchers studying larger organisms, including 
models such as Drosophila and Caenorhabditis, as well as 
plants and humans, represent growing communities that 
look to SGD for information when their research leads to 
genes with similarity to one of the many that are already 
well characterized in yeast. Educators and students in 
genetics and cellular biology comprise another large com- 
munity that SGD serves, as do bioinformatics scientists 
who perform genome-wide computational analyses, for 
either yeast or comparative studies. 

In this update the organization and enhancement to the 
SGD resource is presented in three parts: the representa- 
tion and integration of experimental results into a rich 
biological model, new query and display tools that 
unlock discovery of this information, and new social 
media outlets to stay informed of new features and data 
provided by SGD. 

REPRESENTATION OF EXPERIMENTAL RESULTS: 
CAPTURING, MAINTAINING AND INTEGRATING 
HIGH-QUALITY ANNOTATIONS 

We define curation as the comprehensive and accurate 
manual extraction of experimental results from 
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peer-reviewed literature. At SGD, curation has been, and 
will continue to be, a core responsibility that we conduct 
supplying a valued service to biomedical research 
communities. Providing comprehensive information for 
gene products, derived via literature curation, is of 
special value because it aids experimental design, facili- 
tates gene function discovery and enhances gene product 
annotations provided by other genomic resources. 

Structured controlled vocabularies (i.e. ontologies) 
include precise definitions representing conserved 
common biological concepts, thereby ensuring a consist- 
ent interpretation of results being represented. Controlled 
vocabularies facilitate the integration of diverse data types 
and promote computational analysis. SGD uses 
controlled vocabularies to describe experimental results 
selected from the complete literature on budding yeast. 
This approach has allowed us to maintain the gold 
standard of annotations describing the complete represen- 
tation of yeast chromosomal features, their functions and 
interactions. The quality of our literature annotations as a 
whole is addressed regularly with group curation exercises 
to ensure that information is entered into SGD accurately 
and consistently. In addition, we are actively researching 
and developing new methods for quality control of Gene 
Ontology (GO) functional annotations (6). These efforts 
in ensuring consistency and quality, coupled with a quick 
review of new papers after publication, ensure the gold 
standard annotations are in pace with scientific progress, 
as defined by experimental results. 

Gene Ontology (GO) annotations 

Manual literature curation provides the highest quality 
representation of published experimental results. At 
SGD, annotation using GO terms captures biological in- 
formation regarding specific gene products in a search- 
able, computable form that can be readily compared 
across organisms. Manual GO annotations for specific 
gene products are chosen carefully within the context of 
all available biological knowledge, and represent the best 
possible summary of the most relevant information. The 
annotations for each gene product are reviewed as a 
whole, and updated as new information becomes avail- 
able, a time- and resource-intensive process. 

GO annotations may also be assigned via computation- 
al methods based on criteria such as protein domains, 
HMMs, or patterns of experimental data, allowing the 
rapid annotation of multiple genes using consistent par- 
ameters. Limitations of computational predictions do 
exist: such as incomplete genomic coverage; the use of 
too-general GO terms; and the inherent biases of different 
methods (6). In 2007, SGD began including computation- 
al predictions from the Gene Ontology Annotation 
(GOA) project at UniProtKB (7,8), whose InterPro tools 
rely in part on the presence of experimentally 
characterized protein domains with manually curated con- 
nections to GO terms. 

Researchers often transfer GO functional annotations 
between homologous proteins, thus the accuracy of SGD's 
literature annotations is of utmost importance (9,10). In 
order to identify efficient methods to keep GO 



annotations comprehensive and current, we explored 
whether the large set of computational predictions could 
be leveraged to find inaccuracies or omissions in our 
manual literature-based GO annotations. In an initial 
feasibility study (6), we examined a subset of genes that 
had a manual GO annotation that was less specific than 
the computational prediction. Specifically we selected 
genes that lacked published functional information 
(manual 'unknown' annotations) but had InterPro com- 
putational predictions. Review of the publications for 
these genes found that a significant proportion of the 
manual annotations could be updated to a defined, experi- 
mentally supported function. In contrast, review of the 
literature for a comparable control set of genes (those 
with manual 'unknown' annotations, but without corres- 
ponding InterPro computational predictions) did not 
identify as many possible improvements. Thus, we have 
demonstrated that the presence of InterPro computational 
predictions can effectively identify annotations that need 
to be updated. 

We expanded this strategy so that each literature-based 
annotation for a gene was compared to computation- 
al predictions generated by different methods in 
each of the three GO aspects (Molecular Function, 
Biological Process, and Cellular Component) (Park, J., 
Costanzo,M.C, Balakrishnan,R., CherryJ.M. and 
Hong, E.L. CvManGO, a method for leveraging compu- 
tational predictions to improve literature-based GO 
annotations, submitted to Database for Biocuration 2012 
virtual issue.). The comparison between literature-based 
annotations and computational predictions identified 
two types of literature-based annotations that needed 
review: those that were less specific than the computation- 
al predictions and those that do not have the same lineage 
to the root term as the computational predictions. Our 
goal is to develop an automated method to identify and 
prioritize genes whose annotations may need to be 
re-examined based on the number of literature-based an- 
notations that need review as well as the distance between 
the GO terms used by the literature-based and computa- 
tional annotations, the number of computational predic- 
tions that suggest a gene should be reviewed, and the date 
the gene was last reviewed. This strategy will be applicable 
to GO annotations for any organism for which both 
literature-based and computational annotation is 
performed. 

Phenotypes 

Complete descriptions of mutant phenotypes are also rep- 
resented using a controlled vocabulary known as the 
Ascomycete Phenotype Ontology (APO; available from 
the OBO Foundry, http://www.obofoundry.org). Each 
phenotype annotation in SGD is represented as an observ- 
able, which describes the entity or process, paired with a 
qualifier that describes the change in that entity or process. 
For example, if a mutant strain has increased sensitivity to 
a chemical compound, relative to wild-type, in which the 
observable is captured as 'resistance to chemicals' and the 
qualifier is 'decreased'. Chemical compounds, which are 
often used to elicit phenotypes, are specified using a 
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controlled vocabulary called the ChEBI ontology (11). 
Phenotypes representing over 17 000 classical and 
high-throughput phenotype assays have been described 
and are available from SGD (12). 

Biochemical pathways 

Biochemical pathways are manually curated by SGD and 
provided using the Pathway Tools browser version 15.0 
(13). The SGD biochemical pathways data set for 
S. cerevisiae, one of the most highly curated data sets 
among all Pathway Tools data sets available, is the gold 
standard for budding yeast; SGD supports an ongoing 
effort to update and enhance these data. The Pathway 
Tools interface provides a complete description of each 
pathway, with molecular structures, E.C. numbers and 
full reference listing. The updated pathways browser 
provides several enhanced features, including download 
of a list of genes found in a pathway for further analysis 
with other tools available at SGD. The pathway browser 
is hyperlinked via the 'Pathways' section of the Locus 
Summary page. The Pathway display is available from 
http://pathway.yeastgenome.org. 

Nomenclature 

SGD continues to maintain the S. cerevisiae genomic no- 
menclature. Our job is to promote the community-defined 
nomenclature standards and to ensure that the 
agreed-upon guidelines are followed in naming new 
genes or assigning new names to previously identified 
genes. Community guidelines state that the first published 
name for a gene becomes the standard name. However, 
prior to publication, a gene name may be registered and 
displayed in SGD in order to notify the community of its 
intended use. If there are disagreements or naming con- 
flicts, we communicate with the relevant researchers within 
the community and negotiate an agreement whenever 
possible. The majority of those working on the gene in 
question must agree to any nomenclature change before 
it is implemented in SGD. In addition to maintaining 
genetic names, SGD ensures that the names of ORFs, 
ARS elements, tRNAs and other chromosomal features 
also conform to agreed-upon formats. Over the past 
2 years 154 new gene names have been assigned and 21 
community-initiated name changes have been processed. 

Maintenance of the S288C reference genome 

SGD maintains, updates and distributes the 5. cerevisiae 
reference genome sequence of the commonly used lab 
strain S288C. The original reference sequence, released 
in April 1996 (14), was the first available eukaryotic 
genome sequence and has been maintained by SGD 
ever since (S. cerevisiae S288C reference genome 
sequence NCBI accession numbers: NC_001133, 
NC_001134, NC_001135, NC_001136, NC_001137, 
NC_001138, NC_001139, NC_001140, NC_001141, 
NC_001142, NC_001143, NC_001144, NC_001145, 
NC_001146, NC_001147 and NC_001148). Between its 
original release and 2010, there have been a total of 239 
changes to the sixteen nuclear chromosomes. In 2010, a 
newly determined reference sequence was released, from 



strain AB972, a direct descendant of S288C and one of the 
strains used in the original 1996 published sequence (14). 
The new reference sequence (version 64) was determined 
by mapping next-generation sequence reads to the 
previous reference (version 63) then manually inspecting 
any disagreements between the new sequence and the old 
reference. The AB972 sequence was released in February 
2011 as the new reference genome for S. cerevisiae strain 
S288C, providing even higher quality sequence from a 
single consistent strain background for all 16 nuclear 
chromosomes. Concurrent with the update of the refer- 
ence sequence we have adopted a reference versioning 
system; the current version of the reference sequence 
with chromosomal feature annotation was released as 
version 64.1. The first part of the version number incre- 
ments when the sequence changes and second part incre- 
ments when there are changes to the gene models. The 
current and all previous versions can be downloaded 
from SGD (http://downloads.yeastgenome.org/sequence/ 
S288C_reference/genome_releases). With this comprehen- 
sive update in place, we plan that future differences 
between the reference and sequenced genes will primarily 
be represented as variations from the reference. However, 
a reference genome must take into consideration the best 
case for that strain and not simply the sequence of one 
individual; therefore, we will accommodate corrections to 
non-functional alleles unique to S288C. We have confi- 
dence that in the near future, the genomes of additional 
isolates of AB972 and other direct derivatives of S288C 
will be determined. Differences between these genomes 
will also be provided as alleles or observed variation of 
the current reference sequence. 

Continued annotation and analysis of the S. cerevisiae 
genome 

Our understanding of the 5". cerevisiae genome is no longer 
limited to a few commonly studied strains such as S288C 
(genomics) (15), W303 (cell cycle and budding) (16) and 
SKI (sporulation and meiosis) (17). Nextgen sequencing 
methods have already provided the genomic sequences of 
tens of S. cerevisiae laboratory and industrial strains 
(http://downloads.yeastgenome.org/sequence/strains), 
and will soon provide the genomic sequences for thou- 
sands of wild isolates. As with other model organisms, 
comparative genomics of many isolates will provide a 
new understanding of the full genetic constituent parts 
of a species. We are compiling 5. cerevisiae genomes avail- 
able from NCBI, selecting genomes based on scientific or 
industrial relevance and sequence quality. We currently 
provide 26 strain genomes available for download and 
include them in our BLAST and pattern matching 
(PatMatch) search tools. In addition, homologs to 
protein-coding genes present in the reference strain have 
been automatically predicted for 18 genomes. The pre- 
dicted ORF sequence as well as a ClustalW alignment of 
all predicted ORFs in all strains is available from the 
Locus Summary page for each protein-coding ORF. 

We are developing a pan-genomic representation of 
5. cerevisiae that includes all protein encoding genes, 
and their variation, found within any strain of 
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S. cerevisiae. SGD has long provided so called 'not in 
S288C genes beginning with those that were genetically 
defined. We are now expanding this to include genes that 
have been observed within the genome of any 5. cerevisiae 
stain. Genes can be lost as isolated populations adapt to 
their environment, and the pan-genomic representation of 
S. cerevisiae provides ready access to these genes, in 
context. For example, the strain S288C, source of the ref- 
erence sequence, is known to be missing several well 
studied genes involved in sugar utilization such as SUC1 
(encoding one of several invertases, sucrose hydrolyzing 
enzymes), MEL1 (encoding an alpha-galactosidase 
required for catabolic conversion of melibiose to 
glucose) and BIOl (encoding a pimeloyl-CoA synthetase) 
[SUC family (18); MEL family (19); BIO family (20)]. In 
addition to the reference genome there are currently 
twelve other genomes with annotations at GenBank and 
UniProt. The growth environment of commercial yeast 
strains, grape juice, is harsher than that faced by labora- 
tory strains. The difference in growth environment is likely 
represented by the different repertoire of gene functions. 
The commercial wine yeast ECU 18 (21) lacks 111 genes 
present in S288C and contains 34 that are not found in 
S288C. These 34 genes of ECU 18 not present in 
S288C include predicted hexose transporters, drug 
transporters, a putative zinc-finger transcription factor, a 
methyltransferase, an oxidoreductase and nucleotide 
transporters. The genomic sequence from another wine 
strain, AMRI1631 (22), is reported to contain several 
other genes not found in S288C. The addition of these 
non-S288C genes to SGD will expand the catalog of func- 
tions that can be explored by yeast genetics. 

As we add additional 'not in S288C genes, we will 
adopt a standard gene categorization for all genes in the 
pan-genome. Genes will be classified into one of three 
types: 'core' genes that are found in all strains, 
'frequent' genes indicating the gene is only found in 
some strains, and 'unique' genes for those that are only 
observed in one stain. Because many of the genomes have 
not been finished, their GenBank records may contain 
partial coding regions, representing gene fragments or 
other ambiguous annotations. Thus many of these 
genomes will require extensive reannotation. SGD will 
not add a gene to the pan-genome unless it meets very 
stringent annotation standards. Because of these issues 
we cannot appropriately estimate the number of core 
and frequently occurring genes within the species. 
However, based on the lack of sequence similarity 
between genes in an annotated strain and S288C, we 
roughly estimate that 1300 genes could be considered in 
the 'unique' class of genes. Therefore, the creation of the 
pan-genome will not be a trivial task but it is reasonable 
that we will be successful. 

Incorporation of high-throughput data sets 

Expression analysis at SGD has a new powerful interface 
and many new data sets. The new interface uses a tool 
called SPELL [Serial Pattern of Expression Levels 
Locator, http://spell.yeastgenome.org (23)], which facili- 
tates the rapid identification of the most informative data 



sets and co-expressed genes based on patterns of expres- 
sion shared with the query gene(s) from a comprehensive 
collection of almost 400 data sets. The expression analysis 
tool can be accessed from the Locus Summary via the 
Expression tab and the Expression Summary histogram. 
Implementation of a new data pipeline and the SPELL 
interface was created through collaboration with the 
SGD Colony at Princeton University. 

Budding yeast has long been used as the initial system 
for developing new high-throughput methodologies; as a 
result, we have a wealth of data types to integrate with 
specific gene products. SGD uses the GBrowse genome 
browser (24) to display diverse types of genomic informa- 
tion, with our most recent efforts focusing on chromatin 
structure, including histone modifications, histone 
variants and nucleosome organization; chromosomal 
maintenance sites for meiotic recombination and origins 
of replication; transcription regulation, including RNA 
Polymerase II occupancy and transcription factor 
binding from ChlP-chip and ChlP-seq assays; and the 
transcriptome, including RNA-seq information and 
3'-UTR analyses. SGD's genome browser is provided 
from http://browse.yeastgenome.org. New high- 
throughput results are frequently added to the browser, 
and all data sets are available for download in standard 
formats (currently 60 published data sets) from the SGD 
file download site, http://downloads.yeastgenome.org 
(Chan,E.T. and Cherry, J. M. Considerations for creating 
and annotating the budding yeast Genome Map at SGD: 
A progress report, submitted to Database for Biocuration 
2012 virtual issue). 

NEW QUERY AND DISPLAY TOOLS: ACCESS TO 
INTEGRATED BUDDING YEAST INFORMATION 

All literature-based manually curated annotations, 
sequence and high-throughput data sets are accessible to 
users via multiple access points and methods. Data are 
integrated for display on SGD's locus-specific pages and 
can be analyzed using web-based tools, or downloaded for 
custom use. 

Downloading gene-specific and data set-specific 
annotations 

Enhancements to SGD's locus specific pages now allow 
download of detailed information. For example, the 
phenotype data is available by selecting the 'Download 
Data' button available from the Phenotype tab of the 
Locus pages. A tab-delimited file of all interaction data 
for a gene can be obtained with the 'Download Unfiltered 
Data' or via the 'Download options' link, which passes the 
list of genes to YeastMine, where the researchers can 
explore SGD's rich collection of associated data, including 
genetic and physical interactions. 

Advanced query 

YeastMine (http://yeastmine.yeastgenome.org; Balakrishnan, 
R., Park, J., Karra,K., Hitz,B.C, Binkley,G., Hong,E.L., 
Sullivan,!, Micklem,G. and Cherry ,J.M. YeastMine- An 
integrated data warehouse of S. cerevisiae data as a 
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multi-purpose tool-kit at SGD. submitted to Database for 
Biocuration 2012 virtual issue) is a multifaceted search and 
retrieval tool that provides access to diverse types of 
curated and systematic data. Searches can be initiated 
with a list of genes, a list of GO terms, or results from 
multiple searches of a variety of data types. At each step, 
the results can be combined for further analysis. Queries 
can be customized by modifying predefined templates, or 
by creating completely new searches via the QueryBuilder. 
Customized queries facilitate access and retrieval of 
specific types of data, without requiring additional 
programming support at SGD. The search and its 
results can be saved or downloaded in customizable file 
formats. YeastMine can currently query the following 
data types: GO annotations, chromosomal features, co- 
ordinates, phenotype annotations, interaction data, 
orthologs, gene expression and curated literature. We 
are automating the update of YeastMine from the 
central SGD databases on a weekly basis. YeastMine 
will continue to expand with more data types, such as 
high-throughput chromosome data sets, biochemical 
pathways and enhanced data presentation. YeastMine 
was developed using the Intermine environment (25) and 
is being employed by other model organism databases 
and the modENCODE project (26) for Drosophila and 
Caenorhabditis genomic data. YeastMine has many other 
features such as Web Services and customized widgets that 
allow enhanced data display and retrieval. Through the 
integration of new high-throughput data and nextgen 
sequencing results, implementation of data models and 
advanced search via YeastMine, and enhancement of 
data presentation with Gbrowse, SGD has become the 
modENCODE-like hub for yeast related data 
(Chan,E.T. and CherryJ.M. Considerations for creating 
and annotating the budding yeast Genome Map at SGD: 
A progress report, submitted to Database for Biocuration 
2012 virtual issue). 

Literature search tools 

The Textpresso [http://textpresso.yeastgenome.org (27)] 
search tool is available to search approximately 50000 
papers collected by the SGD project. SGD Textpresso 
provides access to the full-text and abstracts of papers 
that have been retrieved for potential curation. Although 
not all such papers were found to include sufficient 
yeast-focused information to be included within SGD, 
these papers may still be generally useful and are thus 
included in the Textpresso catalog. While full-text search 
is available from many other sources, Textpresso provides 
the unique ability to quickly and effectively explore a lit- 
erature collection using very specific, complex queries, 
such as finding all sentences containing a description of 
two interacting genes. 

KEEPING CURRENT WITH CHANGES AT SGD: 
ANNOUNCEMENTS AND SOCIAL MEDIA 

News announcements 

SGD distributes a quarterly newsletter via our web site 
and direct email. The newsletter focuses on new data 



and tool development at SGD. We also provide significant 
community news such as future scientific meetings and 
new laboratory resources. Blog-style news announcements 
will be available on the SGD home page providing news 
noteworthy for the fungal genetics researcher. We are 
updating our software to allow reporting of news about 
yeast genomics, notable awards to community members, 
publication of highly significant results and new methods. 
All posts are also distributed via RSS feed. 

Social media 

SGD is now on Facebook and Twitter. Users can 'Like' us 
on Facebook or follow @yeastgenome on Twitter to 
receive updates about new features, tips on using SGD, 
real-time meeting updates and other interesting tidbits 
about yeast. 
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